Create siblings datasets

The script below demonstrates how to build a dataset consisting of siblings, using information about parity (number in birth order per mother). Since the number of repetitive operations is quite extensive, loop functionality is used to simplify. You can read more about loops here.

The procedure reviewed is based on splitting the total population according to parity. Then you convert from person to mother level and connect all the datasets via an identifier that points to the mother.

The dataset you are left with in the end consists of mothers as a unit, as well as data on up to 9 siblings. This is a limitation due to the fact that the occurrences of sibling groups greater than 9 make up a number which means that the respective datasets fall below the minimum limit of 1000.

In the example, information is obtained about the sibling's place of residence at age 16, gender, and age as of 2023. It is also possible to use other information such as income, education, job or social security status, etc.

 require no.ssb.fdb:30 as db

// Check parity (number in birth order) for the population and then create dataset of mothers
create-dataset mothers
import db/BEFOLKNING_PARITET as parity
import db/BEFOLKNING_MOR_FNR as motherid

textblock
Total population distributed by parity (number in birth order):
endblock
tabulate parity

collapse(count) parity -> num_children, by(motherid)

textblock
Total population of mothers distributed by number of children:
endblock
tabulate num_children, missing

// Create separate datasets based on parity, change unit type to mother's id, and merge all with mother dataset
for i in 1:9  
  let var1 = municipality_16year ++ $i
  let var2 = gender ++ $i
  let var3 = age ++ $i
  let n = n ++ $i
  
  create-dataset parity ++ $i
  import db/BEFOLKNING_PARITET as parity
  keep if parity == $i

  import db/BEFOLKNING_MOR_FNR as motherid
  import db/NUDB_KOMM_16 as $var1 
  import db/BEFOLKNING_KJOENN as $var2
  import db/BEFOLKNING_FOEDSELS_AAR_MND as birthdate
  generate $var3 = 2023 - int(birthdate/100)
  
  destring $var1 $var2
  drop parity birthdate

  generate $n = 1
  collapse (count) $n (mean) $var1 $var2 $var3, by(motherid)
  tabulate $n, freq cellpct
  merge $n $var1 $var2 $var3 into mothers
end

// The mother dataset now consists of data for up to 9 children (siblings) per mother, i.e., the first 9 born. Children who are even later in the order are not included because they make up very few individuals (falling under the unit limit for datasets of 1000). 

// Also note that there are duplicates within some of the parities, cf. tables over n1, n2, n3, ..., n9 where the value is greater than 1 (only one child per parity should occur). This is due to mis-encoding of parity.

use mothers
generate num_siblings = rowvalid(n1, n2, n3, n4, n5, n6, n7, n8, n9)

textblock
Statistics on the number of siblings in all sibling flocks. Only counting sibling flocks up to 9 children.
Note: The statistics are based on parity data.
endblock
tabulate num_siblings

textblock
Distribution by gender and number of children per parity for no. 1, 2 and 3. You can see that there is a small share of duplicates given by n1, n2, n3 > 1. Further, you can see how these are distributed between genders (man = 1 and woman = 2, decimals indicate that the duplicates are of different genders): 
endblock
tabulate gender1 n1
tabulate gender2 n2
tabulate gender3 n3

textblock
Average age for parity no. 1, 2 and 3:
endblock
summarize age1 age2 age3

textblock
Discrete histograms showing the age composition for parity no. 1 and 2 in cases where there are duplicates. For parity 1, you can see that age with decimals occur, which indicates that some of the duplicates are of different ages. For parity 2, it seems that the duplicates are of the same age as only integers occur:
endblock
histogram age1 if n1 > 1, discrete
histogram age2 if n2 > 1, discrete

textblock
Before further analyses, you should figure out what to do with the duplicates (which are due to mis-encoding). You could, for example, remove duplicate occurrences or choose one of them according to certain criteria. One of several possible sources of error could be people changing person id type from DNR to FNR, so that they appear as two people in the dataset. This can be checked by adding the variable `BEFOLKNING_MRK_FNR` or a variable showing country of birth or immigrant status.
endblock